Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to:

  1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,

    df <- createDataFrame(list(list(a=1L, b="2")))
    count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))

    Before:

    Error in handleErrors(returnStatus, conn) :
      ... 
      java.lang.UnsupportedOperationException
        ...
    

    After:

    Error in handleErrors(returnStatus, conn) :
     ...
     java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
        ...
    
  2. Update documentation about the schema matching for gapply and dapply.

Why are the changes needed?

To show which schema is not matched, and let users know what's going on.

Does this PR introduce any user-facing change?

Yes, error message is updated as above, and documentation is updated.

How was this patch tested?

Manually tested and unitttests were added.

val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
assert(outputTypes == actualDataTypes, "Invalid schema from gapply(): " +
s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
batch.rowIterator().asScala
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is same as dapply:

columnarBatchIter.flatMap { batch =>
val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType())
assert(outputTypes == actualDataTypes, "Invalid schema from dapply(): " +
s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}")
batch.rowIterator.asScala

@HyukjinKwon
Copy link
Member Author

@viirya can you take a quick look when you're available?

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126766 has finished for PR 29283 at commit 8ed454a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2020

Test build #126767 has finished for PR 29283 at commit ab5ecde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice, the improved error message. Just one minor comment about the doc.

@viirya
Copy link
Member

viirya commented Jul 30, 2020

LGTM

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.0. Thanks @viirya.

HyukjinKwon added a commit that referenced this pull request Jul 30, 2020
…pply with Arrow vectorization

### What changes were proposed in this pull request?

This PR proposes to:

1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,

    ```R
    df <- createDataFrame(list(list(a=1L, b="2")))
    count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
    ```

    **Before:**

    ```
    Error in handleErrors(returnStatus, conn) :
      ...
      java.lang.UnsupportedOperationException
	    ...
    ```

    **After:**

    ```
    Error in handleErrors(returnStatus, conn) :
     ...
     java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
        ...
    ```

2. Update documentation about the schema matching for `gapply` and `dapply`.

### Why are the changes needed?

To show which schema is not matched, and let users know what's going on.

### Does this PR introduce _any_ user-facing change?

Yes, error message is updated as above, and documentation is updated.

### How was this patch tested?

Manually tested and unitttests were added.

Closes #29283 from HyukjinKwon/r-vectorized-error.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@SparkQA
Copy link

SparkQA commented Jul 30, 2020

Test build #126798 has finished for PR 29283 at commit 33d0cec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the r-vectorized-error branch December 7, 2020 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants